Voice recognition method

ABSTRACT

The subject of the invention is a voice recognition procedure which comprises: (a) a step of decomposition of a digitised voice signal into a plurality of fractions, (b) a step of representation of each of the fractions by means of a representative vector X t , (c) a step of classification of the representative vectors X t , which comprises two or more multistep binary tree residual vectorial quantisations. In the classification step a phonetic representation is associated to each representative vector X t , allowing a sequence of phonetic representations to be obtained.

DESCRIPTION

1. Field of the Invention

This invention concerns the sector of automatic voice recognition forextensive and continuous vocabularies and in a manner which isindependent of the speaker.

The invention refers to a voice recognition procedure which comprises:

-   -   (a) a step of decomposing a digitised voice signal into a        plurality of fractions,    -   (b) a step of representation of each of the fractions by a        representative vector X_(t), and    -   (c) a step of classification of the representative vectors        X_(t), in which each representative vector X_(t) is associated        with a phonetic representation, which allows a sequence of        phonetic representations to be obtained.

The invention further refers to an information technology system whichcomprises an execution environment suitable for executing an informationtechnology programme which comprises voice recognition means.

The invention also refers to an information technology programme whichcan be directly loaded in the internal memory of a computer and aninformation technology programme stored in a medium suitable for beingused by a computer.

2. State of the art

In general, automatic voice recognition systems function in thefollowing manner: in an initial step, the analog signal whichcorresponds to the acoustic pressure is captured with a microphone andis introduced into an analog/digital converter that will sample thesignal with a given sampling frequency. The sampling frequency used isusually double the maximum frequency in the signal, which isapproximately from 8 to 10 kHz for voice signals and 4 kHz for voicesignals carried by telephone. Once digitised, the signal is divided intofractions of from 10 to 30 milliseconds duration, generally with someoverlap between one fraction and the next.

A representative vector is calculated from each fraction, generally bymeans of a transform to the spectral plane using Fast Fourier Transform(FFT) or some other transform and subsequently taking a given number ofcoefficients of the transform. In most cases first and second orderderivatives of the transformed signal are also used to better representthe variation over time of the signal. Currently, the use of cepstralcoefficients is fairly widespread, which are obtained from a spectralrepresentation of the signal subsequently subdivided into its Mel orBark forms and adding delta and delta-delta coefficients. The details ofsuch implementations are known and can be found for example in (1).

Once the representative vectors have been obtained, there follows aclassification or decodification process with respect to such to obtaina recognition of some subunit present in the voice signal: words,syllables or phonemes. This process is based on the modelling of theacoustic signal through techniques such as Hidden Markov Models (HMM),described in (2), Dynamic Time Warping (DTW), described in (3) or HiddenDynamic Models (HDM), a recent example of which is (4). In all suchsystems a large amount of test data is used to train and calculate theoptimal parameters of the model which will then be used to classify ordecode the representative vectors of the voice signal which one wishesto recognise.

Currently, the most widespread systems are those that use Markov Models.In continual speech frequent coarticulatory phenomena are produced,which cause the modification of the pronunciation characteristics of thephonemes, and even the disappearance of many of them in a continualsequence. This, combined with the variability proper to the voice signalcharacteristics of every individual speaker means that the rate ofdirect recognition of vocal subunits in a continual voice signal andwith unlimited vocabulary is relatively low. Most systems use phonemesas the principal vocal subunit, grouping them in n groups (calledn-grams) to be able to apply statistical information relative to theprobabilities that a phoneme follows another in a given language, suchas is described in (5). As shown in (5), the application of n-gramscontinues to be insufficient to obtain acceptable recognition rates,which is the reason for all advanced systems using language models whichincorporate dictionaries with a high number of precoded words (typicallybetween 60,000 and 90,000) and with information on the occurrenceprobabilities with respect to individual words and ordered combinationsof words. Examples of such systems are (6) and (7). The application ofthese techniques significantly improves the recognition rate forindividual words, although with the drawback of increased systemcomplexity and a limitation of the system's generic use in situationswhere a significant number of words not found in the dictionary canoccur.

SUMMARY OF THE INVENTION

The objective of the present invention is to overcome the abovedrawbacks. This aim is achieved by a voice recognition procedure such asindicted at the beginning of this specification, characterised in thatsaid classification step comprises at least a multistep binary treeresidual vectorial quantisation.

Another feature of the invention is an information technology systemwhich comprises an execution environment suitable for executing aninformation technology programme which comprises voice recognition meansthrough at least a multistep binary tree residual vectorial quantisationaccording to the invention.

A further feature of the invention is an information technologyprogramme which can be directly loaded in the internal memory of acomputer which comprises instructions suitable for performing aprocedure according to the invention.

Finally, another feature of the invention is an information technologyprogramme stored in a medium suitable for being used by a computer whichcomprises instructions suitable for performing a procedure according tothe invention.

Preferably the classification step comprises at least two successivevectorial quantisations, and more preferably the classification stepcomprises a first vectorial quantisation suitable for classifying eachof the representative vectors X_(t) in a group of among 256 possiblegroups, and a second vectorial quantisation suitable for classifyingeach of the representative vectors X_(t) classified within each of the256 groups in a subgroup of among at least 4096 possible subgroups, andadvantageously 16,777,216 possible subgroups, for each of the groups. Aparticularly advantageous embodiment of the invention is obtained whenat least one of the vectorial quantisations is a multistep binary treewith symmetrical reflection residual vectorial quantisation.

Preferably the phonetic representation is a subphonic element, althoughin general the phonetic representation can be any known voice signalsubunit (syllables, phonemes or subphonic elements).

Advantageously the digitised voice signal is decomposed into a pluralityof fractions which are partially overlapped.

Another advantageous embodiment of the procedure according to theinvention is obtained when subsequent to the classification step thereis a segmentation step which allows the phonetic representations to bejoined to form groups with a greater phonetic length. I.e., what is doneis to take the sequence of subphonic elements which is being obtained,segment it into small fragments, and then the subphonic elements of thefragments obtained are grouped into phonemes within the same segment orfragment. Preferably the segmentation step comprises a group search ofat least two subphonic elements which each comprise at least oneauxiliary phoneme, and a grouping of the subphonic elements which arecomprised between each pair of the groups which form segments ofsubphonic elements.

It is particularly advantageous that the segmentation step comprise astep grouping subphonic elements into phonemes, in which the groupingstep is performed with respect to each of the segments of subphonicelements and comprises the following substeps:

-   -   1. Starting from the sequence of segments of subphonic elements:        {Φ_(j,m) ^(t)}1≦t≦L        in which L is the segment length.    -   2. Initialise i=1    -   3. Initialise s=i;e=i;n_(j)=0;n_(m)=0 for 1≦j≦60;1≦m≦60    -   4. ${If}\quad\left\{ \begin{matrix}        {{\left\{ {j \in \varphi_{j,m}^{i}} \right\} = \left\{ {j \in \varphi_{j,m}^{i + 1}} \right\}};} & {n_{j} = {n_{j} + 1}} \\        {{\left\{ {m \in \varphi_{j,m}^{i}} \right\} = \left\{ {m \in \varphi_{j,m}^{i + 1}} \right\}};} & {n_{m} = {n_{m} + 1}}        \end{matrix} \right.$    -   5. If {jεΦ_(j,m) ^(i)}≠{jεΦ_(j,m) ^(i+1)} and {mεΦ_(j,m)        ^(i)}≠{mεΦ_(j,m) ^(i+1)} the following grouping is performed:        f=index max {n_(j), n_(m)1≦j≦60; 1≦m≦60}{Φ_(j,m) ^(t)};        s≦t≦e→Φ_(f)        i=i+1; If i<L=1 return to substep 3, otherwise finalise the        segmentation.    -   6. i=i+1; If i<L−1 return to substep 4, otherwise go to substep        5 and finalise the segmentation.

Basically, what is taking place in this case is to take the segmentsobtained and perform a grouping of the chains of subphonic elements intophonemes.

Preferably the procedure comprises a learning step in which at least oneknown digitised voice signal is decomposed into a phoneme sequence andeach phoneme into a sequence of subphonic elements, and subsequently asubphonic element is assigned to each representative vector X_(t)according to the following rules:

-   -   1. Φ_(k−1), Φ_(k), Φ_(k+1), . . . being the phoneme sequence, in        which the phoneme Φ_(k) is produced in the time segment [t_(i)        ^(k), t_(f) ^(k)], in correspondence with the sequence of        representative vectors {X_(t)}.    -   2. The representative vectors {X_(t)} are assigned to subphonic        units according to the rule: $\quad\left\{ \begin{matrix}        {\left. X_{t}\rightarrow{{/\varphi_{k - 1}}{\_\varphi}_{k}} \right.\operatorname{/;}} & {t_{i}^{k} < t \leq {t_{i}^{k} + {0\text{,}2\left( {t_{f}^{k} - t_{i}^{k}} \right)}}} \\        {\left. X_{t}\rightarrow{{/\varphi_{k}}{\_\varphi}_{k}} \right.\operatorname{/;}} & {{t_{i}^{k} + {0\text{,}2\left( {t_{f}^{k} - t_{i}^{k}} \right)}} < t \leq {t_{i}^{k} + {0\text{,}8\left( {t_{f}^{k} - t_{i}^{k}} \right)}}} \\        {\left. X_{t}\rightarrow{{/\varphi_{k}}{\_\varphi}_{k + 1}} \right.\operatorname{/;}} & {{t_{i}^{k} + {0\text{,}8\left( {t_{f}^{k} - t_{i}^{k}} \right)}} < t \leq t_{f}^{k}}        \end{matrix} \right.$

Generally the decomposition into phonemes of the digitised voice signalused in learning is performed manually, and the decomposition intosubphonic elements can be performed automatically with the previousrules, starting from the manual decomposition into phonemes.

Advantageously the procedure comprises a step of reduction of theresidual vectorial quantisation tree which comprises the followingsubsteps:

-   -   1. An initial value is given to p=number of steps.    -   2. The branches of the residual vectorial quantisations, which        are situated in step p are taken, i.e., the vectors c_(j) _(P)        such that longitude (j^(P))=p    -   3. If the vector c_(j) _(p−1) _(—) ₀ and the vector c_(j) _(p−1)        _(—) ₁ are both associated to the same subphonic element        Φ_(j,m), step p is discarded and the subphonic element Φ_(j,m)        is associated with the vector c_(j) _(p−1) .    -   4. If p>2, p=p−1 is taken and substep 2 is repeated.

This step of reduction of the residual vectorial quantisation tree isperformed advantageously after the learning step.

Preferably the representative vector is of 39 dimensions, which are 12normalised Mel-cepstral coefficients, the energy in logarithmic scale,and its first and second order derivatives.

BRIEF DESCRIPTION OF THE DRAWINGS

Other advantages and characteristics of the invention will become betterapparent from the following description, which illustrates, in anentirely non limitative capacity, some preferable modes of embodiment ofthe invention, with reference to the appended drawings, in which:

FIG. 1, is a block diagram of a procedure according to the invention,and

FIG. 2, is a diagram of a step of recognition of phoneticrepresentations.

DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION

This invention describes a method of automatic voice recognition forunlimited and continues vocabularies and in a manner which isindependent of the speaker. The method is based in the application ofmultistep vectorial quantisation techniques to carry out thesegmentation and classification of the phonemes in a highly precisemanner.

Specifically, the average energy of each fraction together with anensemble of Mel-cepstral coefficients and the first and secondderivatives of both the average energy and the cepstral vector are usedto form a representative vector of each fraction in 39 dimensions. Thesevectors are passed through a first quantisation step formed by an 8 stepbinary tree vectorial quantiser which performs a first classification ofthe fraction, and which is designed and trained with classical vectorialquantisation techniques. The function of this first quantiser is simplyto segment the fractions into 256 segments. For each of these segments a24 step binary tree with symmetrical reflection vectorial quantiser isdesigned separately.

So then, each vector will have been segmented into a binary stringcomposed of 32 digits: 8 from the first segmentation, and 24 from thesubsequent step. These binary strings are associated with phoneticrepresentations during the vectorial quantiser training step.

As from this point, the decoding of each vector is carried outperforming this process and taking the phonetic representationassociated to the resulting binary string. The words are recognised byperforming string-matching between the sequences of phoneticrepresentations resulting from a new phonetic distance formula.

All vectors from the dictionaries can be stored in 75 Mb of memory, andeach decoding requires the calculation of a maximum of 32 vectorialdistortions. Having the dictionaries in memory, the entire process canbe performed in real time with a PC of moderate power.

Phoneme individual recognition rates obtained are higher than 90%,allowing high precision word recognition without needing a prior worddictionary, and with a far more simplified calculation complexity.

The procedure according to the invention is illustrated in FIG. 1. Theoriginal voice signal which, in general, can be audio or video, can beanalog (AVA) or digital (AVD). If the voice signal is not originally indigital format, it must first pass through a sampling and quantisationstep (10). Once in digital format, the signal passes through an acousticpre-processing block (20) to obtain a series of representative vectorswhich describe the most important characteristics to be able to effectphonetic recognition. These vectors are subsequently processed in thephonetic recognition step (30), where they are compared with thelibraries of subphonic elements (40) to obtain the sequence of elementswhich most closely approximates the sequence of entrance vectors.Finally, the sequence of subphonic elements thus obtained is segmented(50) and reduced to a simpler phonetic representation, and saved in adatabase (60) so that retrieval can be carried out efficiently whenperforming a search. The above steps are described in greater detailbelow.

Step 10 corresponds to a conventional analog/digital converter. Thevoice signal coming from a microphone, video tape or any other analogmeans is sampled at a frequency of 11 kHz and with a resolution of 16bits. The sampling can be performed at any other frequency (for example,16 kHz) without any restriction in scope. In fact, the choice of optimumsampling frequency will depend on the system application: forapplications in which the original sound comes from recording studios oris a high quality recording a sampling frequency of 16 kHz is preferablesince it will allow the representation of a greater range of frequenciesfrom the original voice signal. On the other hand, if the original soundcomes from bad quality recordings (conferences, recordings with a PCmicrophone, multimedia files coded to low resolution for transmission byInternet, . . . ) it would be better to use a lower sampling frequency,11 kHz, which would reduce part of the background noise or which can becloser to the frequency of the original signal (the case of files codedfor transmission by Internet, for example). Naturally, if a givensampling frequency is chosen the entire system must be trained with thisfrequency.

Step 20 corresponds to the acoustic pre-processing. The aim of this stepis to transform the sequence of original samples of the signal into asequence of representative vectors which represent characteristics ofthe signal, which allow better modelling of the phonetic phenomena andwhich are not so correlated with respect to one another. Therepresentative vectors are obtained as follows:

-   -   1. The sequence of original signal samples is structured into        fractions corresponding to 30 msec. of signal. The fractions are        taken each 10 msec., which is to say that they overlap. To        eliminate part of the undesirable effects produced by the        fraction overlap, these are weighted with a Hamming window.    -   2. A pre-emphasis filter is applied to the fractions with the        following transference function: H(z)=1−0, 97z⁻¹    -   3. For each fraction, a vector of 12 Mel-cepstral coefficients        is calculated:        x_(t)(k)1≦k≦12

In which t corresponds to the fraction number, which is to say, it istaken every 10 msec.

-   -   4. The Mel-cepstral coefficient vector is normalised with        respect to the fraction average present in the phrase. Since the        exact duration of the phrase is not known, and neither are the        beginning and end points, an average duration of 5.5 sec. is        taken, and thus normalisation is:        ${\mu_{t}(k)} = {{\frac{1}{550}{\sum\limits_{t - 550}^{t}\quad{{x_{t}(k)}\quad 1}}} \leq k \leq 12}$        {overscore (x)} _(t)(k)=x _(t)(k)−μ_(t)(k)1≦k≦12    -   5. The first and second order derivatives of the normalised        vector are also taken:        Δ{overscore (x)} _(t)(k)={overscore (x)} _(t+2)(k)−{overscore        (x)} _(t−2)(k)1≦k≦12        ΔΔ{overscore (x)} _(t)(k)=Δ{overscore (x)} _(t+1)(k)−Δ{overscore        (x)} _(t−1)(k)1≦k≦12    -   6. Finally, the energy in logarithmic scale of each fraction is        normalised with respect to its maximum value and the first and        second order derivatives are taken:        {overscore (x)} _(t)(0)=x _(t)(0)−max{x _(i)(0)}        Δ{overscore (x)} _(t)(0)={overscore (x)} _(t+2)(0)−{overscore        (x)} _(t−2)(0)        ΔΔ{overscore (x)} _(t)(0)=Δ{overscore (x)} _(t+1)(0)−Δ{overscore        (x)} _(t−1)(0)

Thus, the representative vector X_(t) has 39 dimensions and is formedby:{{overscore (x)} _(t)(k)},{Δ{overscore (x)} _(t)(k)}, {ΔΔ{overscore (x)}_(t)(k)}, {overscore (x)} _(t)(0), Δ{overscore (x)} _(t)(0),ΔΔ{overscore (x)} _(t)(0)1≦k≦12

Details as to the calculations necessary for the acoustic pre-processingstep can be found in (8), (1).

Recognition of phonetic representations is performed in step 30. Theobjective of this step is to associate the entrance representativevector X₁ with a subphonic element. Each individual person speaks at adifferent speed, and in addition the speed (measured in phonemes persecond) of each individual also varies depending on his state of mind orlevel of anxiety. In general, the speed of speech varies between 10 and24 phonemes per second. Since the representative vectors X₁ arecalculated every 10 ms, there are 100 vectors per second, and thus eachindividual phoneme is represented by between 4 and 10 representativevectors. This means that each representative vector X₁ represents asubphonic acoustic unit, since its duration will be less than that of anindividual phoneme.

Each language can be described with an ensemble of phonemes of limitedsize. The English language, for example, can be described with about 50individual phonemes. An additional 10 fictitious phonemes are added torepresent different types of sound present in the signal: PhonemeRepresents Phoneme Represents /UH/ Interjections such as /EXHALE/Exhalation uh, ah, em, er, . . . /bSIL/ Beginning of a /CLICK/ Soundproduced by the silence tongue /eSIL/ End of a silence /SMACK/ Clickproduced by the tongue /SIL/ Middle of a silence /SWALLOW/ Swallowsaliva /INHALE/ Inhalation /NOISE/ Any other type of unclassified noisewhich forms a total of 60 phonetic units. Any other number of phonemescan be used without any restriction in scope.

During continuous speech frequent coarticulatory phenomena are alsoproduced, in which the pronunciation of each phoneme is effected by thephonemes which immediately precede or follow it, since the humanarticulatory system cannot change position immediately between thepronunciation of one phoneme and the next. This effect is modelled bytaking binary combinations of the sixty original phonemes, to form aclassificatory ensemble comprising the 3600 possible combinations,although many of them never arise in practice. Taking of binarycombinations is enough, since the model works on the subphonic level.

By way of example, an individual pronunciation of the English word batcould be represented in the system on exiting step 30 by a sequence suchas:

. . . , /eSIL_B/, /B_B/, /B_B/, /B_AE/, /B_AE/, /AE_AE/, /AE_AE/,/AE_AE/, /AE_AE/, /AE_AE/, /AE_TD/, /AE_TD/, /TD_TD/, /TD_TD/, /TD_TD/,/TD_bSIL, /TD_bSIL/, . . .

FIG. 2 shows a functional diagram of step 30. The sequence ofrepresentative vectors {X_(t)} passes firstly through an 8 step binarytree vectorial quantiser (100), which carries out a first classificationof the acoustic signal. On exiting this quantiser representative vectorsare obtained which are unchanged with respect to entrance, butclassified into 256 different groups, i.e., a first partition of thespace of {X_(t)} into regions and a partition of the entrance sequenceinto {X_(t→i)}1≦i≦256 is performed. The vectorial quantiser of 100 isdesigned and trained in a conventional manner according to the originalalgorithm presented in (9), using the Euclidean distance in the39-dimension space to calculate distortion.

To train the vectorial quantiser 100 and in general the whole system,300 audio hours of different origins and with different speakers wasprepared which was manually segmented and annotated to the level ofphonemes. Since initial recognition is performed at the level ofsubphonic units, the decomposition of the phonemes into subphonic unitsfor the training sequence was performed according to the followingrules:

-   -   1. Supposing the phoneme sequence . . . , Φ_(k−1), Φ_(k),        Φ_(k+1), . . . is taken, in which the phoneme Φ_(k) is produced        in the time segment [t_(i) ^(k),t_(f) ^(k)] and t is in units of        10 ms, in correspondence with the sequence of representative        vectors {X_(t)}.    -   2. The representative vectors {X_(t)} are assigned to subphonic        units according to the rule: $\quad\left\{ \begin{matrix}        {\left. X_{t}\rightarrow{{/\varphi_{k - 1}}{\_\varphi}_{k}} \right.\operatorname{/;}} & {t_{i}^{k} < t \leq {t_{i}^{k} + {0\text{,}2\left( {t_{f}^{k} - t_{i}^{k}} \right)}}} \\        {\left. X_{t}\rightarrow{{/\varphi_{k}}{\_\varphi}_{k}} \right.\operatorname{/;}} & {{t_{i}^{k} + {0\text{,}2\left( {t_{f}^{k} - t_{i}^{k}} \right)}} < t \leq {t_{i}^{k} + {0\text{,}8\left( {t_{f}^{k} - t_{i}^{k}} \right)}}} \\        {\left. X_{t}\rightarrow{{/\varphi_{k}}{\_\varphi}_{k + 1}} \right.\operatorname{/;}} & {{t_{i}^{k} + {0\text{,}8\left( {t_{f}^{k} - t_{i}^{k}} \right)}} < t \leq t_{f}^{k}}        \end{matrix} \right.$

Other examples of algorithms based in tree vectorial quantisers tocalculate probability functions or for recognition tasks can be found inpatents (10) and (11), although the algorithms and procedures are verydifferent to those described in the present invention.

Once the {X_(t)} space has been segmented, step 110 performs theclassification of the representative vectors {X_(t→i)} into subphonicunits. With the model adopted, on exiting 110 a subphonic sequence{Φ_(j,m) ^(t)} will have been obtained such that:Φ_(j,m)=|Φ_(j) _(—) Φ_(m)|1≦j≦60;1≦m≦60represents the recognised subphonic units.

Step 110 represents a 24 step residual vectorial quantiser. In general,X¹ is defined as a k-dimensional aleatory vector with probabilitydistribution function F_(X) ₁ (.). A k-dimensional vectorial quantiser(VQ) is described by the triplet (C, Q, P), in which C represents thedictionary of vectors, Q the association function and P the partition.If the quantiser has N vectors, its functioning is such that, given arealisation x¹ of X¹, the quantised vector Q(x¹) is the vector c_(i)εC;1≦i≦N such that the distance between x¹ and C_(i) is the lesser for anyc_(i)εC; 1≦i≦N. Which is to say, the vector c_(i) is taken which is thebest approximation to the entrance vector x¹ for a given distancefunction, generally the Euclidean distance. The partition P segments thespace into N regions, and the vectors C are the centroids of theirrespective regions. The problem of the calculation of the triplet (C, Q,P ) so that the VQ with N vectors is the best approximation to a givenaleatory vector X¹ can be resolved generically with the algorithm LBG(12), which in essence is based on providing a sufficiently longtraining sequence representative of X¹ and successively optimising theposition of the vectors C in the partitions P and subsequently theposition of the partitions P with respect to the vectors C untilachieving a minimum of distortion on the training sequence. The VQ ofstep 100 represents a particular case of binary tree organised VQ suchthat quantisation is less expensive in terms of the complexity of thecalculation. In the present case, X¹ is considered the representativevector X_(t) and to each vector c_(i)εC; 1≦i≦N is associated a subphonicrepresentation Φ_(j,m), so that the VQ carries out a recognition of theentrance representative vectors {X_(t)}.

Naturally, it is desirable to have a VQ with the greatest N possible andwith moderate calculation complexity to attain the best approximation to{X_(t)}. The problem is that a greater N also increases the length ofthe training sequence necessary and the complexity of the training ofthe VQ and subsequently of the decoding and recognition. One solution isthe use of residual vectorial quantisers (RVQ). An RVQ of P steps isformed by an ensemble of P VQ's, {(C^(P),Q^(P),P^(P));1≦p≦P} ordered sothat for a realisation x¹ of X¹, the VQ (C¹,Q¹,P¹) quantises the vectorx¹ and the remaining steps (C^(p+1),Q^(p+1),P^(p+1)) quantise theresidual vectors x^(p+1)=x^(p)−Q(x^(p)) of the previous step (C^(p),Q^(p), P^(p)), for 1≦p≦P. Each dictionary C^(p) contains N^(p) vectors.Both the vectors of C^(p) and the cells of P^(p) are indexed with thesubindex j^(p), where j^(p)εJ^(p)={0,1, . . . N^(p)−1}. The multistepindex j^(p) is the P-tuple formed by the concatenation of the individualindexes of each step j^(p) and represents the course through all theRVQ. I.e., j^(p)=(j¹, j², . . . , j^(p)), and the quantised vector{circumflex over (x)}¹ is obtained as the sum of the vectors quantisedin each step${\hat{x}}^{\prime 1} = {\sum\limits_{p = 1}^{P}c_{j^{\quad p}}^{p}}$

The advantage of the RVQ's is that each individual VQ will have N^(p)vectors and a sufficiently long training sequence can be obtained, butthe total RVQ will have $N = {\prod\limits_{p = 1}^{P}\quad N^{p}}$

vectors, a much greater number that couldn't be trained. Additionally,if each of the steps is a tree structured VQ, the complexity of thetotal decodification will be low.

However, the problem of the RVQ's is that the quality of thequantisation obtained with a total of N vectors is far inferior to thatof a normal VQ or tree structured VQ of a single step with the same Nvectors, the reason for which they have not been widely used. It hasbeen shown in (13) that the loss of quality is due to the fact that eachof the steps is optimised with the LBG algorithm separately from therest, and the decodification is also carried out separately in each ofthe steps, thereby the accumulated effect of the decisions taken in thesteps subsequent to a given step can cause the resulting quantisedvector to be outside the partition cell selected in the first step. Ifin addition the VQ's in each of the steps have a tree structure, theresult will be that many of the vector combinations of the differentsteps will be effectively inaccessible.

Step 110 has been designed from an algorithm proposed in (13) toconstruct RVQ's based in binary trees with symmetric reflection.Specifically, a 24 step binary tree with symmetric reflection RVQ isused. The ensemble of steps {(C^(p),Q^(p),P^(p));1≦p≦P} with P=24 whichform the RVQ is represented in FIG. 1 as element 40. Naturally, a designbased in a different number of steps could be used without anyrestriction in scope.

At this time 2³² different vectors are being handled, each oneassociated to a phonetic subelement of which there will be 60² differentvalues. In addition, in reality it is known that many of the possibleΦ_(j,m) combinations will not be produced. Thus, it is known that foreach Φ_(j,m) combination obtained really a great number of possiblevectors c_(j) _(p) of the RVQ in 40 will correspond to said combination.To reduce the quantity of memory necessary and the complexity of thedecodification, the following RVQ tree reduction algorithm is appliedonce it is already trained:

-   -   1. Initialise p=24.    -   2. The branches of the RVQ, which are situated in step p, are        taken, i.e., the vectors c_(j) _(p) such that longitude        (j^(p))=p    -   3. If the vector c_(j) _(p−1) _(—) ₀ and the vector c_(j) _(p−t)        _(—) ₁ are both associated to the same subphonic element        Φ_(j,m), step p is discarded and the subphonic element Φ_(j,m)        is associated with the vector c_(j) _(p−1) .    -   4. If p>2, p=p−1 is taken and substep 2 is repeated.

In point 2, j^(p) is the binary index of the vector within the binaryvectorial quantiser, its longitude thus representing the level of thetree to which it has descended, i.e., which residual quantisation stepit is at, the quantisers being multistep binary tree with symmetricalreflection quantisers.

With respect to point 3 it should be taken into account that since it isa binary tree, in each step there are two vectors, which are labelledwith the subindexes _(—)0 and _(—)1. To define the quantised vector thisdistinction is not necessary, because the value of the corresponding bitin the index already marks which of the two has been chosen. At thispoint what is being looked at is whether the two vectors correspond tothe same subphonic element, this step is no longer necessary becausewhatever its resolution the same classification will be arrived at.

This algorithm allows reduction by approximately 2⁹of the number ofassociations to save, and in addition also reduces the number ofcomparisons to perform when decoding, without any loss in recognitionprecision.

Exiting step 110 there is then the sequence of recognised subphonicelements {Φ_(j,m) ^(t)}. In step 50 the segmentation of the subphonicelements is carried out. Initially, the sequence of subphonic elements{Φ_(j,m) ^(t)} is segmented performing a detection of the subphonicelements which incorporate one of the auxiliary phonemes represented inthe table given above. The original sequence will be segmented whenevertwo consecutive subphonic elements comprise one of the auxiliaryphonemes. These segments will provide a first estimation of words,although the segmentation will not be overly precise and several groupsof joined words will be obtained in the same segment. Subsequently thefollowing algorithm is used to group the subphonic elements intophonemes:

-   -   1. Starting from the segmented sequence of subphonic elements:        {Φ_(j,m) ^(t)}1≦t≦L        in which L is the segment length.    -   2. Initialise i=1    -   3. Initialise s=i;e=i;n_(j)=0;n_(m)=0 for 1≦j60;1≦m≦60    -   4. ${If}\quad\left\{ \begin{matrix}        {{\left\{ {j \in \varphi_{j,m}^{i}} \right\} = \left\{ {j \in \varphi_{j,m}^{i + 1}} \right\}};} & {n_{j} = {n_{j} + 1}} \\        {{\left\{ {m \in \varphi_{j,m}^{i}} \right\} = \left\{ {m \in \varphi_{j,m}^{i + 1}} \right\}};} & {n_{m} = {n_{m} + 1}}        \end{matrix} \right.$    -   5. If {jεΦ_(j,m) ^(i)}≠{jεΦ_(j,m) ^(i+1)} and {mεΦ_(j,m)        ^(i)}≠{mε_(j,m) ^(i+1)} the following grouping is performed:        f=index max {n_(j), n_(m)1≦j≦60;1≦m≦60}{Φ_(j,m) ^(t) };s≦t≦e→Φ        _(f)

i=i+1; If i<L−1 return to substep 3, otherwise finalise thesegmentation.

-   -   6. i=i+1; If i<L−1 return to substep 4, otherwise go to substep        5 and finalise the segmentation.

This algorithm groups the subphonic elements into phonemes, which arethe elements which finally will be saved in the database. This allowsreduction of the amount of information to be stored in the database by afactor in the order of 6 to 9, which facilitates subsequent processing.

References

Below are listed all of the bibliographical references cited in theabove description. All the following bibliographical references (1),(2), (3), (4), (5), (6), (7), (8), (9), (10), (11), (12) and (13) havebeen included herein for reference purposes.

-   -   (1) Rabiner, L. and Juang, B. H., “Fundamentals of Speech        Recognition”, Prentice-Hall, Englewood Cliffs, N.J., 1993.    -   (2) Levinson, S. E., Rabiner, L. R. and Sondhi, M. M., “An        Introduction to the Application of the Theory of Probabilistic        Functions of a Markov Process to Automatic Speech Recognition”,        The Bell System Technical Journal, Vol. 62, No. 4, April 1983,        pp.1035-1074.    -   (3) Itakura, F., “Minimum Prediction Residual Principle Applied        to Speech Recognition”, IEEE Transactions on Acoustics, Speech        and Signal Processing, Vol. ASSP-23, No. 1, February 1975, pp.        66-72.    -   (4) Deng, Li and Ma, Jeff, “Spontaneous speech recognition using        a statistical coarticulatory model for the vocal-tract-resonance        dynamics”, Journal of the Acoustical Society of America,        Vol.108, No. 5, November 2000.    -   (5) Corinna, Ng., Wilkinson, Ross and Zobel, Justin,        “Experiments in spoken document retrieval using phoneme        n-grams”, Speech Communication, Vol. 32, 2000, pp. 61-77.    -   (6) Renals, S., Abberley, D., Kirby, D. and Robinson, T.,        “Indexing and retrieval of broadcast news”, Speech        Communication, Vol. 32, 2000, pp. 5-20.    -   (7) Johnson, S. E., Jourlin, P., Spatrck Jones, K. and        Woodland, P. C., “Spoken Document Retrieval for TREC-9 at        Cambridge University”, Proceedings of the TREC-9Conference, to        be published.    -   (8) Picone, J. W., “Signal Modeling Techniques in Speech        Recognition”, Proceedings of the IEEE, Vol. 81, No. 9, September        1993, pp.1215-1247.    -   (9) Gray, R. M., Abut, H., “Full search and tree searched vector        quantisation of waveforms”, Proceedings of the IEEE ICASSP, pp.        593-596, Paris, 1982.    -   (10) Watanabe, T., “Pattern recognition with a tree structure        used for reference pattern feature vectors or for HMM”,        EP0627726, Nippon Electric Co., 1994.    -   (11) Seide, F., “Method and system for pattern recognition based        on tree organized probability densities”, U.S. Pat. No.        5,857,169, Philips Corp., 1999.    -   (12) Linde, Y., Buzo, A., Gray, R. M., “An algorithm for vector        quantiser design”, IEEE Transactions on Communications, pp.        84-95, January 1980.    -   (13) Barnes, C. F., Frost, R. L., “Residual vector quantisers        with jointly optimized codebooks”, Advances in Electronics and        Electron Physics, 1991

1. Voice recognition procedure which comprises: (a) a step ofdecomposing a digitised voice signal into a plurality of fractions, (b)a step of representation of each of the fractions by a representativevector X_(t), and (c) a step of classification of said representativevectors X_(t) in which each representative vector X_(t) is associatedwith a phonetic representation, which allows a sequence of phoneticrepresentations to be obtained characterised in that said classificationstep comprises at least one multistep binary tree residual vectorialquantisation.
 2. Procedure according to claim 1, characterised in thatsaid classification step comprises at least two successive vectorialquantisations.
 3. Procedure according to claim 2, characterised in thatsaid classification step comprises a first vectorial quantisationsuitable for classifying each of said representative vectors X_(t) in agroup of among 256 possible groups, and a second vectorial quantisationsuitable for classifying each of said representative vectors X_(t)classified within each of said 256 groups in a subgroup of among atleast 4096 possible subgroups, and preferably 16,777,216 possiblesubgroups, for each of said groups.
 4. Procedure according to one ofclaims 2 or 3, characterised in that at least one of said vectorialquantisations is a multistep binary tree with symmetrical reflectionresidual vectorial quantisation.
 5. Procedure according to at least oneof claims 1 to 4, characterised in that said phonetic representation isa subphonic element.
 6. Procedure according to at least one of claims 1to 5, characterised in that said fractions are partially overlapped. 7.Procedure according to at least one of claims 1 to 6, characterized inthat subsequent to said classification step there is a segmentation stepwhich allows the said phonetic representations to be joined to formgroups of greater phonetic length.
 8. Procedure according to claim 7,characterised in that said segmentation step comprises a group search ofat least two subphonic elements which each comprise at least oneauxiliary phoneme, and a grouping of the subphonic elements which arecomprised between each pair of said groups which forms segments ofsubphonic elements.
 9. Procedure according to one of claims 7 or 8,characterised in that said segmentation step comprises a step groupingsubphonic elements into phonemes, in which said grouping step isperformed on each of said segments of subphonic elements and comprisesthe following substeps:
 1. Starting from the sequence of segments ofsubphonic elements:{Φ_(j,m) ^(t)}1≦t≦L in which L is the segment length.
 2. Initialise i=13. Initialise s=i;e=i;n_(j)=0;n_(m)=0 for 1≦j≦60;1≦m≦60 4.${If}\quad\left\{ \begin{matrix}{{\left\{ {j \in \varphi_{j,m}^{i}} \right\} = \left\{ {j \in \varphi_{j,m}^{i + 1}} \right\}};} & {n_{j} = {n_{j} + 1}} \\{{\left\{ {m \in \varphi_{j,m}^{i}} \right\} = \left\{ {m \in \varphi_{j,m}^{i + 1}} \right\}};} & {n_{m} = {n_{m} + 1}}\end{matrix} \right.$
 5. If {jεΦ_(j,m) ^(i) }≠{jεΦ _(j,m) ^(i+1)} and{mεΦ_(j,m) ^(i) }≠{mεΦ _(j,m) ^(i+1)} the following grouping isperformed:f=index max {n_(j), n_(m)1≦j≦60;1≦m≦60}{Φ _(j,m) ^(t) }; s≦t≦e→Φf i=i+1;If i<L−1 return to substep 3, otherwise finalise the segmentation. 6.i=i+1; If i<L−1 return to substep 4, otherwise go to substep 5 andfinalise the segmentation.
 10. Procedure according to at least one ofclaims 1 to 9, characterised in that it comprises a learning step inwhich at least one known digitised voice signal is decomposed into aphoneme sequence and each phoneme is decomposed into a sequence ofsubphonic elements, and subsequently a subphonic element is assigned toeach representative vector X_(t) according to the following rules: 1.Φ_(k−1), Φ_(k), Φ_(k+1), . . . being the phoneme sequence, in which thephoneme Φ_(k) is produced in the time segment [t_(i) ^(k),t_(f) ^(k)],in correspondence with the sequence of representative vectors {X_(t)}.2. The representative vectors {X_(t)} are assigned to subphonic unitsaccording to the rule: $\quad\left\{ \begin{matrix}{\left. X_{t}\rightarrow{{/\varphi_{k - 1}}{\_\varphi}_{k}} \right.\operatorname{/;}} & {t_{i}^{k} < t \leq {t_{i}^{k} + {0\text{,}2\left( {t_{f}^{k} - t_{i}^{k}} \right)}}} \\{\left. X_{t}\rightarrow{{/\varphi_{k}}{\_\varphi}_{k}} \right.\operatorname{/;}} & {{t_{i}^{k} + {0\text{,}2\left( {t_{f}^{k} - t_{i}^{k}} \right)}} < t \leq {t_{i}^{k} + {0\text{,}8\left( {t_{f}^{k} - t_{i}^{k}} \right)}}} \\{\left. X_{t}\rightarrow{{/\varphi_{k}}{\_\varphi}_{k + 1}} \right.\operatorname{/;}} & {{t_{i}^{k} + {0\text{,}8\left( {t_{f}^{k} - t_{i}^{k}} \right)}} < t \leq t_{f}^{k}}\end{matrix} \right.$
 11. Procedure according to at least one of claims1 to 10, characterised in that it comprises a step of reduction of theresidual vectorial quantisation tree which comprises the followingsubsteps:
 1. An initial value is given to p=number of steps.
 2. Thebranches of the residual vector quantisations which are situated in stepp are taken, i.e., the vectors c_(j) _(p) such that longitude (j^(P))=p3. If the vector c_(j) _(p−1) _(—) ₀ and the vector c_(j) ^(p−1) _(—) ₁are both associated to the same subphonic element Φ_(j,m), step p isdiscarded and the subphonic element Φ_(j,m) is associated with thevector c_(j) _(p−1) .
 4. If p>2, p=p−1 is taken and substep 2 isrepeated.
 12. Procedure according to claim 11, characterised in thatsaid step of reduction of the residual vectorial quantisation tree isperformed subsequently to said learning step.
 13. Procedure according toat least one of claims 1 to 12, characterised in that saidrepresentative vector is of 39 dimensions, which are 12 normalisedMel-cepstral coefficients, the energy in logarithmic scale, and itsfirst and second order derivatives.
 14. Information technology systemwhich comprises an execution environment suitable for executing aninformation technology programme characterised in that it comprisesvoice recognition means through at least one multistep binary treeresidual vectorial quantisation according to at least one of claims 1 to13.
 15. Information technology programme which can be loaded directlyinto the internal memory of a computer characterised in that itcomprises instructions suitable for performing a procedure according toat least one of claims 1 to
 13. 16. Information technology programmestored in a medium suitable to be used by a computer characterised inthat it comprises instructions suitable for performing a procedureaccording to at least one of claims 1 to 13.